Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769
Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769WenqingLan1 wants to merge 33 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #769 +/- ##
=======================================
Coverage 85.70% 85.71%
=======================================
Files 103 103
Lines 7907 7908 +1
=======================================
+ Hits 6777 6778 +1
Misses 1130 1130
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Updates the GPU STREAM microbenchmark to support runtime-selectable FP32/FP64 execution and improve GPU memory bandwidth utilization, while aligning SuperBench integration (CLI, output tags, docs, and tests) to the new behavior.
Changes:
- Add
--data_type <float|double>to select FP32/FP64 at runtime and propagate it through the Python benchmark wrapper + unit tests. - Refactor CUDA kernels to use 128-bit vectorized accesses (
double2/float4) and move template kernel implementations into a header for cross-TU instantiation. - Adjust execution/output to single visible GPU (device 0 via
CUDA_VISIBLE_DEVICES) and update metric/tag formats (removinggpu_id) plus docs/examples/test log.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/data/gpu_stream.log |
Updates golden log output to include data type and new tag format (no gpu_id). |
tests/benchmarks/micro_benchmarks/test_gpu_stream.py |
Extends command-generation assertions to include --data_type (currently only covers double). |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp |
Removes NUMA/GPU iteration fields from args and adds Opts::data_type. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp |
Adds CLI parsing/printing for --data_type. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp |
New entry point replacing the previous main file. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp |
Introduces vector-type mapping and templated kernel definitions (128-bit loads/stores). |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu |
Keeps a CUDA compilation unit and moves template implementations to the header. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp |
Expands bench-args variant to support float and double. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu |
Uses local NUMA allocation, enforces 16B/thread sizing, launches templated vectorized kernels, updates tag format, and runs only CUDA device 0. |
superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt |
Switches target sources to the new gpu_stream_main.cpp. |
superbench/benchmarks/micro_benchmarks/gpu_stream.py |
Adds --data_type argument and forwards it to the binary. |
examples/benchmarks/gpu_stream.py |
Updates example invocation to include --data_type double. |
docs/user-tutorial/benchmarks/micro-benchmarks.md |
Updates gpu-stream metric patterns to include `(double |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 14 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 14 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:90
- ParseOpts initializes size_specified to true, which means omitting --size will not trigger the missing-required-arg check, unlike the pattern used in other micro-benchmarks (e.g., cpu_copy/gpu_copy). If --size is intended to be required per PrintUsage, size_specified should start as false (and only be set true when parsed).
{"check_data", no_argument, nullptr, static_cast<int>(OptIdx::kEnableCheckData)},
{"data_type", required_argument, nullptr, static_cast<int>(OptIdx::kDataType)}};
int getopt_ret = 0;
int opt_idx = 0;
bool size_specified = true;
bool num_warm_up_specified = false;
bool num_loops_specified = false;
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 14 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99
size_specifiedis initialized totrue, which makes the "required option" validation (if (!size_specified || ...)) ineffective for--sizeand inconsistent withPrintUsage()(which shows--sizeas required). Either initializesize_specifiedtofalse(to enforce--size) or removesize_specifiedfrom the required-check and update the usage text accordingly.
int getopt_ret = 0;
int opt_idx = 0;
bool size_specified = true;
bool num_warm_up_specified = false;
bool num_loops_specified = false;
bool parse_err = false;
while (true) {
getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
if (getopt_ret == -1) {
if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
parse_err = true;
}
break;
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 14 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:89
size_specifiedis initialized totrue, so the required-argument validation (if (!size_specified || ...)) can never fail due to a missing--size. This is inconsistent with other micro-benchmark option parsers in the repo (which start this flag asfalse) and withPrintUsage()indicating--sizeis required. Initializesize_specifiedtofalse(or remove the flag entirely if--sizeis intended to be optional) so missing/invalid size handling is unambiguous.
{"data_type", required_argument, nullptr, static_cast<int>(OptIdx::kDataType)}};
int getopt_ret = 0;
int opt_idx = 0;
bool size_specified = true;
bool num_warm_up_specified = false;
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 14 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:92
ParseOptsinitializessize_specifiedtotrue, which makes--sizeeffectively optional even though the usage string lists it as required and other micro-benchmarks start this flag asfalse. This is likely unintended and makes the required-argument validation inconsistent. Initializesize_specifiedtofalse(or update the required-argument logic/usage text if--sizeis intentionally optional due to the default).
int getopt_ret = 0;
int opt_idx = 0;
bool size_specified = true;
bool num_warm_up_specified = false;
bool num_loops_specified = false;
bool parse_err = false;
Refinements:
Note: metric tag removed gpu_idx and the execution is per-process, so users need to update the configs & rules.
New config:
New rule:
Example results:
Processed by rules: